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Several methods are discussed that construct a finite automaton given a context-free gram- 
mar, including both methods that lead to subsets and those that lead to supersets of the 
original context-free language. Some of these methods of regular approximation are 

l/~) | new, and some others are presented here in a more refined form with respect to exist- 
ing literature. Practical experiments with the different methods of regular approximation 
are performed for spoken-language input: hypotheses from a speech recognizer are filtered 

^J ' through a finite automaton. 



1 Introduction 



Several methods of regular approximation of context-free languages have been proposed 
in the literature. For some, the regular language is a superset of the context-free language, 
and for others it is a subset. We have implemented a large number of methods, and where 
needed we refined them with an analysis of the grammar. We also propose a number of 
S— j ' new methods. 

The analysis is based on a sufficient condition for context-free grammars to generate 

regular languages. For an arbitrary grammar, this analysis identifies sets of rules that 

need to be processed in a special way in order to obtain a regular language. The nature 

CTN ' of this processing differs for the respective approximation methods. For other parts of 

the grammar, no special treatment is needed and the grammar rules are translated to 

^ ' states and transitions of a finite automaton without affecting the language. 

Few of the published articles on regular approximation have discussed the application 
in practice. In particular, little attention has been given to the following two questions. 
First, what happens when a context-free grammar grows in size? What is then the in- 
crease of the sizes of the intermediate results and the obtained minimal deterministic 
automaton? Second, how "precise" are the approximations? That is, how much larger 
than the original context-free language is the language obtained by a superset approx- 
imation, and how much smaller is the language obtained by a subset approximation? 
(How we measure the "sizes" of languages in a practical setting will become clear in the 
sequel.) 

Some considerations with regard to theoretical upper bounds on the sizes of the 



intermediate results and the finite automata have already been discussed by Nederhof 

|(1997| ) . In this article we will try to answer the above two questions in a practical setting, 

using practical linguistic grammars and sentences taken from a spoken-language corpus. 

The structure of this paper is as follows. In Section we recall some standard defini- 
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tions from language theory. Section O investigates a sufficient condition for a context-free 
grammar to generate a regular language. We also present the construction of a finite 
automaton from such a grammar. 

In Section 0, we discuss several methods to approximate the language generated by a 
grammar if the sufficient condition mentioned above is not satisfied. These methods can 
be enhanced by a grammar transformation presented in Section @. Section ra compares 
the respective methods, which leads to conclusions in Section R. 

2 Preliminaries 



Throu ghout this paper we use standard formal language notation (see e.g. Harrison 



19781)). In this section we recall some basic definitions. 



A context-free grammar G is a 4-tuple (£, N, P, S), where E and N are two finite 
disjoint sets of terminals and nonterminals, respectively, S £ N is the start symbol, and 
P is a finite set of rules. Each rule has the form A — > a with A £ N and a £ V* , where 
V denotes N U E. The relation — > on N x V* is extended to a relation on V* x V* as 
usual. The transitive and reflexive closure of — ► is denoted by — >*. 

The language generated by G is given by the set {w £ E* \ S — >* w}. By defini- 
tion, such a set is a context-free language. By reduction of a grammar we mean the 
elimination from P of all rules A — > 7 such that S — >* aA(3 — > 07/3 -^* w does not hold 
for any a, (3 £ V* and w £ E*. 

We generally use symbols A, B,C, . . . to range over N, symbols a, b, c, . . . to range 
over E, symbols A, Y, Z to range over V, symbols a, /3, 7, . . . to range over V* and symbols 
u, iu, 2;, . . . to range over E*. We write e to denote the empty string. 

A rule of the form A — > B is called a unit rule. 

A (nondeterministic) finite automaton T is a 5-tuple (K, E, A, s, F), where K is 
a finite set of states, of which s is the initial state and those in F C K are the final 
states, E is the input alphabet, and the transition relation A is a finite subset of 
K x E* x K. 

We define a configuration to be an element of K x E* . We define the binary relation 
h between configurations as: (q,vw) h (q',w) if and only if (q,v,q') £ A. The transitive 
and reflexive closure of h is denoted by h*. 

Some input u is recognized if (s,f) H* (<?,€), for some q £ F. The language ac- 
cepted by T is defined to be the set of all strings v that are recognized. By definition, 
a language accepted by a finite automaton is called a regular language. 

3 Finite Automata in Absence of Self-embedding 

We define a spine in a parse tree to be a path that runs from the root down to some leaf. 
Our main interest in spines lies in the sequences of grammar symbols at nodes bordering 
on spines. 

A simple example is the set of parse trees such as the one in Figure nl, for a grammar of 
palindromes. It is intuitively clear that the language is not regular: the grammar symbols 
to the left of the spine from the root to e "communicate" with those to the right of the 
spine. More precisely, the prefix of the input up to the point where it meets the final 
node e of the spine determines the suffix after that point, in a way that an unbounded 
quantity of symbols from the prefix need to be taken into account. 

A formal explanation for why the grammar may not generate a regular language 
relies on the following definition (phomsky, 1959b): 



Definition 
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S — » a S a 
S^b S b 



a S a 



b S b 



Figure 1 

Grammar of palindromes, and a parse tree. 



A grammar is self-embedding if there is some A e TV such that A 

a ^ e and /? / e. 



ctAfi, for some 



If a grammar is not self-embedding, this means that when a section of a spine in a parse 
tree repeats itself, then either no grammar symbols occur to the left of that section of the 
spine, or no grammar symbols occur to the right. This prevents the "unbounded com- 
munication" between the two sides of the spine exemplified by the palindrome grammar. 
We now prove that grammars that are not self-embedding generate regular languages. 
For an arbitrary grammar, we define the set of recursive nonterminals as: 



N = {Ae TV | 3a,P[A 



tA(3}} 



We determine the partition A/" of TV consisting of subsets N±, N 2 , 
of mutually recursive nonterminals: 



, Nk, for some k > 0, 



N={N 1 ,N 2 ,...,N k }_ 
Ni U TV 2 U • • • U N k = TV 
\/i[N ^ 0] and Vi,j[i ^ j => N t n N 3 =0] ___ 

3i[AeNABeN t ] o 3ai,/3i,a 2 ,/3 2 [-4 ->* aiB/3i AS ->* a^Afe], for all A,B g TV 

We now define the function recursive from A/" to the set {left, right, self , cyclic}. For 
<i < k: 

RightGenerating(Ni) 
-iRightGenerating(Ni) 

RightGenerating(Ni) 
-iRightGenerating(Ni) 

where 



recursive(Ni) 


= left, 


if 


-iLeftGenerating (Ni ) 


A 




= right , 


if 


LeftGenerating(Ni) 


A 




= self, 


if 


LeftGenerating(Ni) 


A 




= cyclic, 


if 


-iLeftGenerating(Ni) 


A 



LeftGenerating(N) 
RightGenerating (N) 



= 3(A -► aBj3) e P[AeNAB eNAa^e] 
= 3(A^aBf3) e P[AeNAB eNAfl^e] 



When recursive(Ni) = left, N consists of only left-recursive nonterminals, which docs 
not mean it cannot also contain right-recursive nonterminals, but in that case right 
recursion amounts to application of unit rules. When recursive(Ni) — cyclic, it is only 
such unit rules that take part in the recursion. 

That recursive(N) = self, for some i, is a sufficient and necessary condition for 
the grammar to be self-embedding. Therefore, we have to prove that if recursive(Ni) G 
{left, right, cyclic}, for all i, then the grammar generates a regular language. Our proof 
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differs from an existing proof ( |Chomsky, 1959a ) in that it is fully constructive: Figure || 



presents an algorithm for creating a finite automaton that accepts the language generated 
by the grammar. 

The process is initiated at the start symbol, and from there the process descends the 
grammar in all ways until terminals are encountered, and then transitions are created 
labelled with those terminals. Descending the grammar is straightforward in the case of 
rules of which the left-hand side is not a recursive nonterminal: the subautomata found 
recursively for members in the right-hand side will be connected. In the case of recursive 
nonterminals, the process depends on whether the nonterminals in the corresponding 
set from Af are mutually left-recursive or right-recursive; if they are both, which means 
they are cyclic, then either subprocess can be applied; in the code in Figure |2| cyclic and 
right-recursive subsets Ni are treated uniformly. 

We discuss the case that the nonterminals are left-recursive. One new state is cre- 
ated for each nonterminal in the set. The transitions that are created for terminals and 
nonterminals not in Nj are connected in a way that is reminiscent of the construction of 



left-corner parsers ( Rosenkrantz and Lewis II, 1970 ), an d specifically of o ne construction 



that focuses on sets of mutually recursive nonterminals ( Nederhof, 1994 , Section 5. 

An example is given in Figure ||. Four states have been labelled according to the 
names they are given in procedure make_fa. There are two states that are labelled qs- 
This can be explained by the fact that nonterminal B can be reached by descending the 
grammar from S in two essentially distinct ways. 

The code in Figure differs from the actual implementation in that sometimes for 
a nonterminal a separate finite automaton is constructed, viz. for those nonterminals 
that occur as A in the code. A transition in such a subautomaton may be labelled by 
another nonterminal B, which then represents the subautomaton corresponding to B. 



The resulting representation is similar to extended context-free grammars (Purdom and 



Brcwn, 1981), with the exception that in our case recursion cannot occur, by virtue of 
the construction. 

The representation for the running example is indicated by Figure EL which shows 
two subautomata, labelled S and B. The one labelled S is the automaton on the top level, 
and contains two transitions labelled B, which refer to the other subautomaton. Note 
that this representation is more compact than the one from Figure 0, since the transitions 
that are involved in representing the sublanguage of strings generated by nonterminal B 
are included only once. 

The compact representation consisting of subautomata can be turned into a single 
finite automaton by substituting subautomata A for transitions lab elled A in oth er au- 
tomata. This comes down to regular substitution in the sense of Berstcl (1979] ). The 



advantage of this way of obtaining a finite automaton over a direct construction of a 
nondeterministic automaton is that subautomata may be determinized and minimized 
before they are substituted into larger subautomata. Since in many cases determinized 
and minimized automata are much smaller, this process avoids much of the combinato- 
rial explosion that takes place upon naive construction of a single nondeterministic finite 
automaton.[J 

Assume we have a list of subautomata A\ , . . . , A m that is ordered from lower level to 
higher level automata; i.e. if an automaton A p occurs as label of a transition of automaton 



1 The representation by Mohri and Pereira (199S) is even more compact than ours for grammars that 
are not self-embedding. However, in the sequel we are going to use our representation as 
intermediate result in approximating an unrestricted context-free grammar, with the final objective 
">f obtaining a, single min imal deterministic automaton. For this purpose, the representation by 



Mohri and Pereira (1998) offers little advantage 
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let K = 0, s = freshstate, f = fresh_state, F = {/}; 
make_fa(s, S, f). 

procedure make-fa(qo,a,qi): 
if a = e 

then let A = A U {(q , e, qi)} 
elseif a = a, some a G S 
then let A = A U {(q , a, q x )} 

elseif a = X/3, some X G V, (3 e V* such that \(i\ > 
then let q = freshstate; 
make_fa(qo, X, q): 
make_fa(q,/3, q{) 
else let A — a: (* a must consist of a single nonterminal *) 
if there exists i such that A G N{ 
then for each B G Ni do let qs = freshstate end; 
if recursive(Ni) = left 

then for each (C -> Xi---X m ) G P such that C €Ni A X lt . . . ,X m £Ni 
do makeja(qo, Xi ■ ■ ■ X m , q c ) 
end; 
for each (C — > DIi • • • X m ) G P such that 

C,DeN l AX ll ...,X m £N l 
do make_fa(q D , Xl • • • X TO , g c ) 
end; 

let A = AU{(q A ,e,9i)} 
else (* recursive(Ni) G {right, cyclic} *) 

for each (C -► .Xi • • -X m ) G P such that C"G^ Mi, . . . ,X m <£ N t 

do make-fa(q c , Xi ■ ■ ■ X m , q\) 

end; 

for each (C — > Xi • • • X m D) G P such that 

C,D€N i AX 1 ,...,X m £N i 
do make-fa(qc , Xi ■ ■ ■ X m , q D ) 
end; 

let A = AU{(<zo,e,^)} 
end 
else for each (.A — > /3) G P do make_fa(qo,f3,qi) end (* A is not recursive *) 
end 
end 
end. 

procedure freshstate (): 

create some object q such that q ^ K; 

let K = KU {<?}; 

return q 
end. 

Figure 2 

Transformation from a grammar G — (£, N, P, S) that is not self-embedding into an equivalent 
finite automaton T — (K, E,A,s,F). 
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N = {S,A,B} 
Af = {N!,N 2 } 

Ni = {S, A} recursive(Ni) 
N2 = {B} recursively) 



left 
left 



Figure 3 

Application of the code from Figure on a small grammar. 






Kg) 



Figure 4 

The automaton from Figure a in a compact representation. 



A qi then p < q; A m must be the start symbol S. This order is a natural result of the way 
that subautomata are constructed during our depth-first traversal of the grammar, which 
is actually postorder in the sense that a subautomaton is output after all subautomata 
occurring at its transitions have been output. 

Our implementation constructs a minimal deterministic automaton by repeating the 
following for p — 1, . . . , m: 

1 . Make a copy of A p . Determinize and minimize the copy. If it has fewer 
transitions labelled by nonterminals than the original, then replace A p by its 
copy. 

2. Replace each transition in A p of the form (q, A r , q') by (a copy of) automaton 
A r in a straightforward way. This means that new e-transitions connect q to 
the start state of A r and the final states of A r to q' . 

3. Again determinize and minimize A p and store it for later reference. 
The automaton obtained for A m after step 3 is the desired result. 

4 Methods of Regular Approximation 

This section describes a number of methods for approximating a context-free grammar 
by means of a finite automaton. Some published methods did not mention self-embedding 



Nederhof Experiments with Regular Approximation 



explicitly as potential source of non-regularity of the language, and suggested that ap- 
proximations should be applied globally for the complete grammar. Where this is the 
case, we adapt the method so that it is more selective and deals with self-embedding 
locally. 

The approximations are integrated into the construction of the finite automaton from 
the grammar, which was described in the previous section. A separate incarnation of the 
approximation process is activated upon finding a nonterminal A such that A £ N and 
recursive(Ni) = self, for some i. This incarnation then only pertains to the set of rules 
of the form B — » a, where B £ N. In other words, nonterminals not in Ni are treated 
by this incarnation of the approximation process as if they were terminals. 

4.1 Superset approximation based on RTNs 



The following approximation was proposed by Nederhof (1997 ). The presentation here 
however differs substantially from the earlier publication, which treated the approxima- 
tion process entirely on the level of context-free grammars: a self-embedding grammar 
was transformed in such a way that it was no longer self-embedding. A finite automata 
was then obtained from the grammar by the algorithm discussed above. 



The presentation here is based on recursive transition networks (RTNs) (Woods 



197C ). We can see a context-free grammar as an RTN as follows. We introduce two 
states qA and q' A for each nonterminal A, and m + 1 states qo, ■ ■ ■ ,q m for each rule 
A—*X-y- X m . The states for a rule A — > X\ ■ ■ ■ X m are connected with each other and 
to the states for the left-hand side A by one transition (qA, e, go), a transition (</i_i, Xi, qi) 
for each i such that 1 < i < m, and one transition (q m ,£,q'A)- (Actually, some epsilon 
transitions are avoided in our implementation, but we will not be concerned with such 
optimizations here.) 

In this way, we obtain a finite automaton with initial state qA and final state q' A for 
each nonterminal A and its defining rules A — ► X± ■ ■ ■ X m . This automaton can be seen 
as one component of the RTN. The complete RTN is obtained by the collection of all 
such finite automata for different nonterminals. 

An approximation now results if we join all the components in one big automaton, 
and if we approximate the usual mechanism of recursion by replacing each transition 
(q,A,q') by two transitions (q, e,qA) and (q' A ,e,q'). The construction is illustrated in 
Figure |g. 

In terms of the original grammar, this approximation can be informally explained 
as follows. Suppose we have three rules B — > aA[3, B' — * a'A[3', and A — > 7. Top-down 
left-to-right parsing would proceed for example by recognizing a in the first rule; it would 
then descend into rule A — > 7, and recognize 7; it would then return to the first rule and 
subsequently process (3. In the approximation however, the finite automaton "forgets" 
which rule it came from when it starts to recognize 7, so that it may subsequently 
recognize /3' in the second rule. 

For the sake of presentational convenience, the above describes a construction work- 
ing on the complete grammar. However, our implementation applies the construction 
separately for each nonterminal in a set Ni such that recursive(Ni) = self, which leads 
to a separate subautomaton of the compact representation (Section [3]). 

See Nederhof (1998| ) for a variant of this approximation that constructs finite trans- 



ducers rather than finite automata. 

We have further implemented a parameterized version of the RTN approximation. 
A state of the nondeterministic automaton is now also associated to a list H of length 
I if I strictly smaller than a number d, which is the parameter to the method. This list 
represents a history of rule positions that were encountered in the computation leading 
to the present state. 
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(b) 




(a) 



A^a B b 
A^c A 

B^d A e 



(c) 





Figure 5 

Application of the RTN method for the grammar at (a). The RTN is given at (b), and (c) 
presents the approximating finite automaton. We assume A is the start symbol and therefore 
qA becomes initial state and q' A becomes final state in the approximating automaton. 



More precisely, we define an item to be an object of the form [A — > a • 0\ , where A — ♦ 
a(3 is a rule from the grammar. These are the same objects as the "dotted" productions 
from Earley (1970 ). The dot indicates a position in the right-hand side. 

The unparameterized RTN method had one state qi for each item /, and two states 
qA and q' A for each nonterminal A. The parameterized RTN method has one state qm 
for each item I and each list of items H that represents a valid history for reaching 
/, and two states qAH and q' AH for each nonterminal A and each list of items H that 
represents a valid history for reaching A. Such a valid history is defined to be a list H 
with < \H\ < d that represents a series of positions in rules that could have been 
invoked before reaching I or A, respectively. More precisely, if we set H = I\ • • ■ I n , then 
each I m (1 < m < ri) should be of the form [A m — > a m • B m f3 m \ and for 1 < m < n we 
should have A m = B m+ i. Furthermore, for a state qm with / = [A — > a • /3] we demand 
A = B\ if n > 0. For a state qAH we demand A = B\ if n > 0. (Strictly speaking, states 
qAH and qm, with \H\ < d — 1 and I = [A — * a • 0\, will only be needed if Aim is the 
start symbol in the case \H\ > 0, or if A is the start symbol in the case H = e.) 

The transitions of the automaton that pertain to terminals in right-hand sides of rules 
are very similar to those in the case of the unparameterized method: For a state qm with 
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I of the form [A — > a • a/3], we create a transition {qm , a, Qi'h), with 7' = [A —> aa • /?]. 

Similarly, we create epsilon transitions that connect left-hand sides and right-hand 
sides of rules: For each state qAH there is a transition (qah ,£,qm) for each item I = 
[A — > • a], for some a, and for each state of the form qr h, with I' = [A — ► a •], there is 
a transition (qi'H,e,<lAH)- 

For transitions that pertain to nonterminals in the right-hand sides of rules, we need 
to manipulate the histories. For a state qm with / of the form [A — ► a • B/3], we create 
two epsilon transitions. One is (g/# , e, qBH'), where if' is defined to be IH if \IH\ < d, and 
to be the first d — 1 items of Zff otherwise. Informally, we extend the history by the item 
I representing the rule position that we have just come from, but the oldest information 
in the history is discarded if the history becomes too long. The second transition is 
(q'bh'^^Ii'h), with I' = [A->aB • j3\. 

If the start symbol is S, the initial state is qs and the final state is q' s (after the 
symbol S in the subscripts we find an empty lists of items) . Note that the parameterized 
method with d = 1 concurs with the unparameterized method, since the lists of items 
then remain empty. 

An example with parameter d = 2 is given in Figure g. For the unparameterized 
method, each I = [A — ► a • (3] corresponded to one state (Figure ||). Since reaching A 
can have three different histories of length shorter than 2 (the empty history, since A is 
start symbol; the history of coming from the rule position given by item [A — > c • A] ; and, 
the history of coming from the rule position given by item [B — > d • Ae] ) , in Figure o we 
now have three states of the form qin for each / = [A — ► a • j3] , as well as three states 
of the form qAH and a^#. 

The higher we choose d, the more precise the approximation is, since the histories 
allow the automaton to simulate part of the mechanism of recursion from the original 
grammar, and the maximum length of the histories corresponds to the number of levels 
of recursion that can be simulated accurately. 

4.2 Refinement of RTN superset approximation 



We rephrase the method by Grimley Evans (1997) as follows. First, we construct the 



approximating finite automaton according to the unparameterized RTN method above. 
Then an additional mechanism is introduced that ensures for each rule A — >■ X% ■ ■ ■ X m 
separately that the list of visits to the states qo , . . . , q m satisfies some reasonable criteria: 
a visit to qi, with < i < to, should be followed by one to qi+\ or q$. The latter option 
amounts to a nested incarnation of the rule. There is a complementary condition for what 
should precede a visit to &;, with < i < to. 

Since only pairs of consecutive visits to states from the set {qo, . . . , q m } are consid- 
ered, finite-state techniques suffice to implement such conditions. This can be realized 
by attaching histories to the states as in the case of the parameterized RTN method 
above, but now each history is a set rather than a list, and can contain at most one 



item [A — ► a • 0\ for each rule A — > a/3. As reported by Grimley Evans (1997 ) and 
confirmed by our own experiments, the nondetcrministic finite automata resulting from 
this method may be quite large, even for small grammars. The explanation is that the 
number of such histories is exponential in the number of rules. 

We have refined the method with respect to the original publication by applying the 
construction separately for each nonterminal in a set Ni such that recursive(Ni) = self . 

4.3 Subset approximation by transforming the grammar 

Putting restrictions on spines is another way to obtain a regular language. Several meth- 
ods can be defined. The first method we present investigates spines in a very detailed 
way. It eliminates from the language only those sentences for which a subderivation is 
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H = £ 



A^a B b 

A^ c A 



H = [A^cA] ( •<■ 



H = [B^d'Ae]< 



B^d A e 

B _,f H = [A^a-Bb] 




AH 



AH 



AH 



BH 



Figure 6 

Application of the parameterized RTN method with d = 2. We again assume A is the start 
symbol. States qin have not been labelled in order to avoid cluttering the picture. 



required of the form B — >* uBf3, for some a ^ e and f3 ^ e. The motivation is that such 
sentences do not occur frequentl y in practice, since these subderivations make it difficult 
for people to comprehend them ( Rcsnik, 1992 ). Their exclusion will therefore not lead to 
much loss of coverage of typical sentences, especially for simple application domains. 

We express the method in terms of a grammar transformation in Figure [?]• The effect 
of this transformation is that a nonterminal A is tagged with a set of pairs (B, Q), where 
B is a nonterminal occurring higher in the spine; for given B, at most one such pair 
(B, Q) can be contained in the set. The set Q may contain the element I to indicate that 
something to the left of the part of the spine from B to A was generated. Similarly, r G Q 
indicates that something to the right was generated. If Q = {I, r}, then we have obtained 
a derivation B ^* aA(3, for some a/t and (3 ^ e, and further occurrences of B below 
A should be blocked in order to avoid a derivation with self-embedding. 

An example is given in Figure pi The original grammar is implicit in the depicted 
parse tree on the left, and contains at least the rules S^Aa, A^bB, B^C 
and C — > S. This grammar is self-embedding, since we have a subderivation S — >* bSa. 
We explain how Fb is obtained from Fa in the rule A Fa — > b B Fb . We first construct 
F' = {(S, {r}), (A, 0)} from F A = {{S, {r})} by adding {A, 0), since no other pair of the 
form (A, Q) was already present. To the left of the occurrence of B in the original rule 
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We are given a grammar G = (S,N,P, S). The following is to be performed for each set 
Ni G Af such that recursive(Ni) = self. 

1. For each AsiVj and each F G 2^ AriX2 ', add the following nonterminal to N. 

. A F . 

2. For each A € Ni, add the following rule to P. 

• A^A®. 

3. For each (A — > aoAiaiA 2 • • ■ a m _iA m a m ) G P such that A, Ai, . . . , A m G iV, 
and no symbols from ao, ■ . . , a m are members of Ni, and each F such that 
(A, {I, r}) £ F , add the following rule to P. 

• A F —* aoA^ai ■ ■ ■ A^a m , where, for 1 < j < m, 

— F 3 = {(B, QUQJU Ql) | (B, Q) e F'}; 

— F' = FU {(A,0)} if -.3Q[(A,Q) G F], and F' = F otherwise; 

— Q\ = if ao^iai • • • j4j_iOj_i = e, and Q; = {Z} otherwise; 

— Q° r = if ajAj + iCkj + i • • • A m a m — e, and Q^ = {r} otherwise. 

4. Remove from P the old rules of the form A — > a, where AejV,. 

5. Reduce the grammar. 

Figure 7 

Subset approximation by transforming the grammar. 



A — > & P we find a non-empty string 6. This means that we have to add I to all second 
components of pairs in F', which gives us Fb = {(S, {I, r}), (A, {I})}. 

In the transformed grammar, the lower occurrence of S in the tree is tagged with the 
set {(S, {l,r}), (A, {I}), (P,0), (C, 0)}. The meaning is that higher up in the spine, we 
will find the nonterminals S, A, B and C. The pair (A, {I}) indicates that since we saw 
A on the spine, something to the left has been generated, viz. b. The pair (P, 0) indicates 
that nothing either to the left or to the right has been generated since we saw P. The pair 
(S, {I, r}) indicates that both to the left and to the right something has been generated 
(viz. b on the left and a on the right). Since this indicates that an offending subderivation 
S -^* aS(3 has been found, further completion of the parse tree is blocked: the trans- 
formed grammar will not have any rules with left-hand side S , {( s <{ / ' r })<( j4 >-[ / })>(B,0),(c,0)\ 
In fact, after the grammar is reduced, any parse tree that is constructed cannot even con- 
tain any longer a node labelled by S'{( s '<i / < r })<( A <i / }) I (5,0),( c ',0)} ; or any nodes with labels 
of the form A F such that (A, {I, r}) G F. 

One could generalize this approximation in such a way that not all self-embedding 
is blocked, but only self-embedding occurring, say, twice in a row, in the sense of a 
subderivation of the form A -^* a\Af3i — >* a\a<i,A^&\- We will not do so here, because 
already for the basic case above, the transformed grammar can be huge due to the high 
number of nonterminals of the form A F that may result; the number of such nonterminals 
is exponential in the size of Ni . 

We therefore present, in Figure O, an alternative approximation that has a lower 
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(a) S (b) S Fs Fs 



a F A = {(5,{r})} 
t F B F B = {{S,{l,r}),(A,{l})} 

S c F c = {(S,{l,r}),(A,{l}),(B,®)} 



B 



C c 
s [fs f s = {(5,{?,r}),(A»),(5,0),(C,0)} 

Figure 8 

A parse tree in a self-embedding grammar (a), and the corresponding parse tree in the 
transformed grammar (b), for the transformation from Figure M. For the moment we ignore 
step 5 of Figure M, i.e. reduction of the transformed grammar. 



complexity. By parameter d, it restricts the number of rules along a spine that may 
generate something to the left and to the right. We do however not restrict pure left 
recursion and pure right recursion. Between two occurrences of an arbitrary rule, we 
allow left recursion followed by right recursion (which leads to tag r followed by tag rl), 
or right recursion followed by left recursion (which leads to tag I followed by tag Ir). 

An example is given in Figure nfl. As before, the rules of the grammar are implicit 
in the depicted parse tree. At the top of the derivation we find S. In the transformed 
grammar, we first have to apply S — > S T '°. The derivation starts with a rule S — > A a, 
which generates a string (viz. a) to the right of a nonterminal (viz. A). Before we can 
apply zero or more of such rules, we first have to apply a unit rule S T '° — > S r,Q in 
the transformed grammar. For zero or more rules that subsequently generate something 
on the left, such as A — > b B, we have to obtain a superscript containing rl, and in 
the example this is done by applying A r '° — » A rl '°. Now we are finished with pure left 
recursion and pure right recursion, and apply B rl, ° — ► B- 1 ' . This allows us to apply one 
unconstrained rule, which appears in the transformed grammar as B >° — > c S T ^ d. 

Now the counter / has been increased from at the start of the subderivation to 
1 at the end. Depending on the value d that we choose, we cannot build derivations by 
repeating subderivation S —>* b c S d a an unlimited number of times: at some point the 



counter will exceed d. If we choose d = 0, then already the derivation at Figure 10 (b) is 
not possible anymore, since no nonterminal in the transformed grammar would contain 
1 in its superscript. 

Because of the demonstrated increase of the counter /, this transformation is guar- 
anteed to remove self-embedding from the grammar. However, it is not as selective as 
the transformation we saw before, in the sense that it may also block subderivations that 
are not of the form A — >* aAft. Consider for example the subderivation from Figure HG, 
but replacing the lower occurrence of S by any other nonterminal C that is mutually 
recursive with S, A and B. Such a subderivation S —+* b c C d a would also be blocked 
by choosing d — 0. In general, increasing d allows more of such derivations that are not 
of the form A -^* aA(3 but also allows more derivations that are of that form. 
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We are given a grammar G = (£, N, P, S). The following is to be performed for each set 
Ni £ M such that recursive(N) = self . The value d stands for the maximum number of 
unconstrained rules along a spine, possibly alternated with a series of left-recursive rules 
followed by a series of right-recursive rules, or vice versa. 

1. For each A £ Ni, each Q £ {T, I, r, Ir, rl, A.}, and each / such that < / < d, 
add the following nonterminals to N . 

• AQ'f. 

2. For each A £ Ni, add the following rule to P. 

• A^A T >°. 

3. For each A £ N and / such that < / < d, add the following rules to P. 

• A T 'f^A 1 '*. 

• A T <f^A r 'f. 

• A l <* -► A lr 'f. 

• A r <f-+A H 't. 

• A lr <f^A^<f. 

• A rl 'f -» A^'f. 

4. For each (A — > Ba) £ P such that A, B £ N and no symbols from a are 
members of N, each / such that < / < d, and each Q £ {r, lr}, add the 
following rule to P. 

5. For each (A — » aB) £ P such that A, B G N and no symbols from a are 
members of ATj, each / such that < / < d, and each Q £ {/, rl}, add the 
following rule to P. 

6. For each (A — > a ^4iQ ! i^42 ■ ■ ■ am-i^ma™) £ -P such that A, Ai, . . . , A m £ iV^ 
and no symbols from ao, ■ • ■ , a m are members of Ni, and each / such that 

< / < d, add the following rule to P, provided m = V / < d. 

. A^f ^a Q A v 1 J+1 a l ---A l J+ l a m . 

7. Remove from P the old rules of the form A — > a, where A <E N. 

8. Reduce the grammar. 

Figure 9 

A more simple subset approximation by transforming the grammar. 

The reason for considering this transformation rather than any other that eliminates 
self-cmbcdding is purely pragmatic: of the many variants we have tried that yield non- 
trivial subset approximations, this transformation has the lowest complexity in terms of 
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Figure 10 

A parse tree in a self-embedding grammar (a), and the corresponding parse tree in the 
transformed grammar (b), for the simple subset approximation from Figure bl 



the sizes of intermediate structures and of the resulting finite automata. 

In the actual implementation, we have integrated the grammar transformation and 
the construction of the finite automaton, which avoids re-analysis of the grammar to 
determine the partition of mutually recursive nonterminals after transformation. This 
integration makes use for example of the fact that for fixed Ni and fixed /, the set of 
nonterminals of the form A >* , with A £ Ni, is (potentially) mutually right-recursive. A 
set of such nonterminals can therefore be treated as the corresponding case from Figure ||, 
assuming the value right. 

The full formulation of the integrated grammar transformation and construction 
of the finite automaton is rather long and is therefore not given here. A very similar 
formulation, for another grammar transformation, is given by Nederhof (1998|). 



4.4 Superset approximation through pushdown automata 

The distinction between context-free languages and regular languages can be seen in 
terms of the distinction between pushdown automata and finite automata. Pushdown 
automata maintain a stack that is potentially unbounded in height, which allows more 
complex languages to be recognized than in the case of finite automata. Regular approxi- 



mation can be achieved by restricting the height of the stack, as we will see in Section 4.5 
or by ignoring the distinction between several stacks when they become too high. 



More specifically, the method proposed by Pereira and Wright (1997) first constructs 
an LR automaton, which is a special case of a pushdown automaton. Then, stacks that 
may be constructed in the course of recognition of a string are computed one by one. 
However, stacks that contain two occurrences of a stack symbol are identified with the 
shorter stack that results by removing the part of the stack between the two occurrences, 
including one of the two occurrences. This process defines a congruence relation on stacks, 
with a finite number of congruence classes. This congruence relation directly defines a 
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finite automaton: each class is translated to a unique state of the nondeterministic finite 
automaton, shift actions are translated to transitions labelled with terminals, and reduce 
actions are translated to epsilon transitions. 

The method has a high complexity. First, construction of an LR automaton, of which 
the size is exponential in the size of the grammar, may be a prohibitively expensive task 
( Nederhof and Satta, 1996| ). This is however only a fraction of the effort needed to 
compute the congruence classes, of which the number is in turn exponential in the size 
of the LR automaton. If the resulting nondeterministic automaton is determinized, we 
obtain a third source of exponential behaviour. The time and space complexity of the 
method are thereby bounded by a triple exponential function in the size of the grammar. 
This theoretical analysis seems to be in keeping with the high costs of applying this 
method in practice, as will be shown later in t his article. 

As proposed by Pereira and Wright (1997 ), our implementation applies the approxi- 
mation separately for each nonterminal occurring in a set AT, that reveals self-embe dding. 

A different superset ap proximation ba sed on LR automata was proposed by Baker 
(19|l|) and rediscovered by Heckert (1994 ). Each individual stack symbol is now trans- 
lated to one state of the nondeterministic finite automaton. It can be argued theoretically 
that this approximation differs from the unparameterized RTN approximation from Sec- 
tion 4.1 only under certain conditions that are not likely to occur very often in practice. 



This consideration is confirmed by our experiments to be discussed later. Our implemen- 
tation differs from the original algorithm in that the approximation is applied separately 
for each nonterminal in a set iVj that reveals self-embedding. 



A generalization of this method was suggested by Bcrmudcz and Schimpf (199C ). 
For a fixed number d > we investigate sequences of d top-most elements of stacks that 
may arise in the LR automaton, and we translate these to states of the finite automaton. 
More precisely, we define another congruence relation on stacks, such that we have one 
congruence class for each sequence of d stack symbols and this class contains all stacks 
that have that sequence as d top-most elements; we have a separate class for each stack 
that contains less than d elements. As before, each congruence class is translated to one 
state of the nondeterministic finite automaton. Note that the case d = 1 is equivalent to 



the approximation by Baker (1981). 

If we replace the LR automaton by a certain type of automaton that performs top- 



down recognition, then the method by Bcrmudcz and Schimpf (1990) amounts to t he 



parameterized RTN method from Section 4.1; note that the histories from Section 4.1 in 



fact function as stacks, the items being the stack symbols. 



4.5 Subset approximation through pushdown automata 

By restricting the height of the stack of a pushdown automaton, one obstructs recogni- 
tion of a set of strings in the context-free language, and therefore a subset approximation 



results. This idea was proposed by Krauwer and des Tombe (1981 ), Langcndoen and 
Langsam (1987|) an d Pulman (1986), and was rediscovered by Black (1989J ) and recently 



by Johnson (1998). Since the latest publication in this area is more explicit in its pre- 
sentation, we will base our treatment on this, instead of going to the historical roots of 
the method. 

One first constructs a modified left-corner recognizer from the grammar, in the form 
of a pushdown automaton. The stack height is bounded by a low number; Johnson (1998 ) 
claims a suitable number would be 5. The motivation for using the left-corner strategy is 
that this bound may not affect the language in case the grammar is not self-embedding, 
and thereby the approximation may be exact. The reason for this is that the height of the 
stack maintained by a left-corner parser is already bounded by a constant in the absence 
of self-embedding. 



15 



Computational Linguistics Volume 00, Number 

Our own implementation is more refined than the published algorithms mentioned 
above, in that it defines a separate left-corner recognizer for each nonterminal A such 
that A £ Ni and recursive(Ni) = self, some i. In the construction for one such recognizer, 
nonterminals that do not belong to iV* are treated as terminals, as in all other methods 
discussed here. 

4.6 Superset approximation by TV-grams 



An approximation from Scyfarth and Bermudcz (1995) can be explained as follows. Define 



the set of all terminals reachable from nonterminal A to be Sa = {a | 3a, 0[A — >* aafl]}. 
We now approximate the set of strings derivable from A by E\ , which is the set of strings 
consisting of terminals from Sa- Our implementation is slightly more sophisticated by 
taking S A to be {X \ 3B, a, j3[B £ N t AB -> aXf3AX £ Ni}}, for each A such that A £ N t 
and recursive(Ni) — self, for some i. I.e. each X £ Sa is a terminal, or a nonterminal 
not in the same set Ni as A, but immediately reachable from set Ni, through B £ Ni. 
This method can be generalized, inspired by Stolcke and Segal (1994[ ) , who derive TV- 



gram probabilities from stochastic context-free grammars. By ignoring the probabilities, 
each N = 1,2,3,... gives rise to a superset approximation that can be described as 
follows. The set of strings derivable from a nonterminal A is approximated by the set of 
strings a\ ■ ■ ■ a n such that 

• for each substring v — a^+i • ■ ■ ai+jv (0 < i < n — N) we have A — >* wvy, for 
some w and y, 

• for each prefix v = a% ■ ■ ■ a, (0 < i < n) such that i < N we have A — >* vy, for 
some y, and 

• for each suffix v — aj+i • • • a n (0 < i < n) such that n — i < N we have 
A — >* wv, for some w. 

(Again, the algorithms that we actually implemented are more refined and take into 
account the sets Ni.) 



The approximation from seyfarth and Bcrmudez (1995) can be seen as the case 



N — 1 , which will henceforth be called the unigram method. We have also experimented 
with the cases N = 2 and N = 3, which will be called the bigram and trigram methods. 

5 Increasing the Precision 

The methods of approximation described above take as input the parts of the grammar 
that pertain to self-embedding. It is only for those parts that the language is affected. 
This leads us to a way to increase the precision: before applying any of the above methods 
of regular approximation, we first transform the grammar. 

This grammar transformation copies grammar rules containing recursive nontermi- 
nals and, in the copies, it replaces these nonterminals by new non-recursive nonterminals. 
The new rules take over part of the roles of the old rules, but since the new rules do not 
contain recursion and therefore do not pertain to self-embedding, they remain unaffected 
by the approximation process. 

Consider for example the palindrome grammar from Figure [11 The RTN method will 
yield a rather crude approximation, viz. the language {a, b}*. We transform this grammar 
in order to keep the approximation process away from the first three levels of recursion. 
We achieve this by introducing three new nonterminals S[l], S[2] and S[3], and by adding 
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modified copies of the original grammar rules, so that we obtain: 



S\l] - 


■* a S[2] a | b S[2] b 


e 


S[2] 


■* a S[3] a | b S[3] b 


e 


S [3] - 


-> aSa | 5Si | t 




S - 


-> aSa 1 iS6 e 





The new start symbol is S[l). 

The new grammar generates the same language as before, but the approximation 
process leaves unaffected the nonterminals S[l], S[2] and 5 [3] and the rules defining 
them, since these nonterminals are not recursive. These nonterminals amount to the 
upper three levels of the parse trees, and therefore the effect of the approximation on 
the language is limited to lower levels. If we apply the RTN method then we obtain 
the language that consists of (grammatical) palindromes of the form ww R , where w € 
{e, a, b}U{a, b} 2 U{a, 6} 3 , plus (possibly ungrammatical) strings of the form wvw R , where 
w S {a, 6} 3 and v s {a, b}* . (w indicates the mirror image of w.) 

The grammar transformation in its full generality is given by the following, which is 
to be applied for fixed integer j > 0, which is a parameter of the transformation, and for 
each Ni such that recursive(Ni) = self. 

For each nonterminal A s Ni we introduce j new nonterminals A[l], . . . , A[j]. For 
each A — > X\ • ■ ■ X m in P such that A <E N, and h such that 1 < h < j, we add 
A[h] ->X[~- X' m to P, where for 1 < k < m: 

X' k = X k [h+1], if X k &N Ah <j 
= Xk, otherwise 

Further, we replace all rules A — ► X% ■ ■ ■ X m such that A £ N by A — ► X[ ■ ■ ■ X' m , where 
for 1 < k < to: 

X' k = X k [l), HXkEN, 
= Xk, otherwise 

If the start symbol S was in Ni, we let 5[1] be the new start symbol. 

A second transformation, which shares some characteristics with the one above, was 



presented by [Nederhof (1997| ). One of the earliest papers sugg esting such transformation s 



as a way to increase the precision of approximation is due to Culik II and Cohen (1973 ) 



who however only discuss examples; no general algorithms were defined. 

6 Empirical Results 

In this section we investigate empirically how the respective approximation methods be- 
have on grammars of different sizes and how much the approximated languages differ 
from the original context-free languages. This last question is difficult to answer in a 
precise way. Both an original context-free language and an approximating regular lan- 
guage generally consist of an infinite number of strings, and the number of strings that 
are introduced in a superset approximation or that are excluded in a subset approxima- 
tion may also be infinite. This makes it difficult to attach numbers to the "quality" of 
approximations . 

We have opted for a pragmatic approach which does not require investigation of the 
entire infinite languages of the grammar and the finite automata, but that looks at a 
certain finite set of strings that we have taken from a corpus, as discussed below. For 
this finite set of strings we measure the percentage that overlaps with the investigated 
languages. 
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Figure 11 

The test material. The left-hand curve refers to the construction of the grammar from 332 
sentences, the right-hand curve refers to the corpus of 1000 sentences used as input to the 
finite automata. 



For the experiments we have taken context-free grammars for German, generated 
automatically from an HPSG and a spoken-language corpus of 332 sentences. This corpus 
consists of sentences possessing grammatical phenomena of interest, manually selected 
from a larger corpus of actual dialogues. An HPSG parser was applied on these sentences, 
and a form of context-free backbone was selected from the first derivation that was found. 
(To take the first derivation is as good as any other strategy, given that we have at present 
no mechanisms for relative ranking of derivations.) The label occurring at a node together 
with the sequence of labels at the daughter nodes was then taken to be a context-free 
rule. The collection of such rules for the complete corpus forms a context-free grammar. 
Due to the incremental nature of this construction of the grammar, we can consider the 
subgrammars obtained after processing the first p sentences, where p = 1, 2, 3, ... , 332. 
See Figure O (left) for the relation between p and the number of rules of the grammar. 
The construction is such that rules have at most two members in the right-hand side. 

As input we consider a set of 1000 sentences, obtained independently from the 332 
sentences mentioned above. These 1000 sentences were found by having a speech recog- 
nizer provide a single hypothesis for each utterance, where utterances come from actual 
dialogues. Figure [Ly (right) shows how many sentences of different lengths the corpus 
contains, up to length 30. Above length 25, this number quickly declines, but still a fair 
quantity of longer strings can be found, e.g. 11 strings of a length between 51 and 60 
words. In most cases however such long strings are in fact composed of a number of 
shorter sentences. 

Each of the 1000 sentences were input in their entirety to the automata, although 
in practical spoken-language systems, often one is not interested in grammaticality of 
complete utterances, but one tries to find substrings that form certain phrases bearing 
information relevant to the understanding of the utterance. We will however not be 
concerned here with the exact way such recognition of substrings could be realized by 
means of finite automata, since this is outside the scope of the present paper. 

For the respective methods of approximation we measured the size of the compact 
representation of the nondeterministic automaton, the number of states and the number 
of transitions of the minimal deterministic automaton, and the percentage of sentences 
that were recognized, in comparison to the percentage of grammatical sentences. For the 
compact representation, we counted the number of lines, which is roughly the sum of 
the numbers of transitions from all subautomata, not considering about three additional 
lines per subautomaton for overhead. 
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Table 1 

Size of the com pact representation an d number of states and transitions, for the refined RTN 
approximation (Grimley Evans, 1997). 



grammar size compact repr # states # transitions 



10 


133 


11 


14 


12 


427 


17 


26 


13 


1,139 


17 


34 


14 


4,895 


17 


36 


15 


16,297 


17 


40 


16 


51,493 


19 


52 


17 


208,350 


19 


52 


18 


409,348 


21 


59 


19 


1,326,256 


21 


61 



We have investigated the size of the compact representation because it is reason- 
ably implementation independent, barring optimizations of the approximation algorithms 
themselves that affect the sizes of the subautomata. Where we show that for some method 
there is a sharp increase in the size of the compact representation for a small increase 
in the size of the grammar, this gives us a strong indication how difficult it would be to 
apply the method to much larger grammars. Note that the size of the compact represen- 
tation is a (very) rough indication as to how much effort is involved in determinization, 
minimization, and substituting the subautomata into each other. For determinization and 
minimization of automata, we have applied programs from the FSM library described 
by Mohri, Pcreira, and Riley (199§| ). This library is considered to be competitive with 
respect to other tools for processing of finite-state machines, and when the programs 
cannot determinize or minimize in reasonable time and space some subautomata con- 
structed by a particular method of approximation, then this can be regarded to be an 
indication of the impracticality of the method. 

We were not able to compute the compact representation for all the methods and 
all the grammars. Quite problematic proved to be the refined RTN approximation from 



Section 4.2. We were not able to compute the compact representation for any of the au- 



tomatically obtained grammars in our collection that were self-embedding. We therefore 
eliminated individual rules by hand starting from the smallest self-embedding grammar in 
our collection, eventually finding grammars small enough to be handled by this method. 
The results are given in Table [|. Note that the size of the compact representation in- 
creases significantly for each additional grammar rule. The sizes of the finite automata, 
after determinization and minimization, remain relatively small. 



Also problematic was the first approximation from Section 4.4, which was based on 
LR parsing following Pereira and Wright (4997 ). Already for the grammar of 50 rules, we 
were not able to determinize and minimize one of the subautomata according to step 4 of 
Section |f we stopped the process after it had become over 600 Megabytes large. Results 
as far as we could obtain them are given in Table 0. Note the sharp increases in the size 
of the compact representation, resulting from small increases, from 44 to 47 and from 47 
to 50, in the number of rules, and note an accompanying sharp increase in the size of 
the finite automaton. For this method, we see no possibility to accomplish the complete 
approximation process, including determinization and minimization, for grammars in our 
collection that are substantially larger than 50 rules. 

Since no grammars of interest could be handled by them, the above two methods 
will be further left out of consideration. 

In the sequel, we refer to the unparameterized and parameterized approximations 
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Table 2 

Size of the compact representation and number of states and transitions, f or the superset 
approximation based on LR automata following Pereira and Wright (1997). 



grammar size 


compact repr 


# 


states 


# transitions 


35 


15,921 




350 


2,125 


44 


24,651 




499 


4,352 


47 


151,226 




5,112 


35,754 


50 


646,419 




? 


? 



based on RTNs (Section |4J) as 'RTN' and 'RTNd,' respectively, for d = 2,3; to the 
subset approximation from Figure |9j as 'Subd,' for d — 1,2,3; and to the second and 
thir d methods from Section 4.4 , which were based on LR parsing following Baker (1981 ) 
and Bcrmudez and Schimpf (1990 ), as 'LR' and 'LRd,' respectively, for d = 2, 3. We refer 
to the subset approximation based on left-corner parsing from Section f4.q as 'LCd,' for 
the maximal stack height of d = 2,3,4; and to the methods discussed in Section 4.6 as 
'Unigram,' 'Bigram' and 'Trigram.' 

We first discuss the compact representation of the nondeterministic automata. In 
Figure [jj] we use two different scales to be able to represent the large variety of values. 
For the method Subd, the compact representation is of purely theoretical interest for 
grammars larger than 156 rules in the case of Subl, for those larger than 62 rules in the 
case of Sub2, and for those larger than 35 rules in the case of Sub3, since the minimal 
deterministic automata could thereafter no longer be computed with a reasonable bound 
on resources; we stopped the processes after they had consumed over 400 Megabytes. For 
LC3, LC4, RTN3, LR2 and LR3, this was also the case for grammars larger than 139, 
62, 156, 217 and 156 rules, respectively. The sizes of the compact representation seem to 
grow moderately for LR and Bigram, in the upper panel, yet the sizes are much larger 
than those for RTN and Unigram, which are indicated in the lower panel. 

The numbers of states for the respective methods are given in Figure [13, again using 
two very different scales. As in the case of the grammars, the terminals of our finite 
automata are parts of speech rather than words. This means that in general there will 
be nondeterminism during application of an automaton on an input sentence due to lex- 
ical ambiguity. This nondeterminism can be handled efficiently using tabular techniques 
provided the number of states is not too high. This consideration favours methods which 
produce low numbers of states, such as Trigram, LR, RTN, Bigram and Unigram. 

Note that the numbers of states for LR and RTN differ only very little. In fact, 
for some of the smallest and for some of the largest grammars in our collection, the 
resulting automata were identical. Remark however that the intermediate results for LR 
(Figure [L2|) are much larger. It should therefore be concluded that the "sophistication" 
of LR parsing is here merely a source of needless inefficiency. 

The numbers of transitions for the respective methods are given in Figure H4. Again 
note the different scales used in the two panels. The numbers of transitions roughly 
correspond to the storage requirements for the automata. It can be seen that again 
Trigram, LR, RTN, Bigram and Unigram perform well. 

The precision of the respective approximations is measured in terms of the percent- 
ages of sentences in the corpus that are recognized by the automata, in comparison to the 
percentage of sentences that are generated by the grammar, as presented by Figure [fq. 
The lower panel represents an enlargement of a section from the upper panel. Methods 
that could only be applied for the smaller grammars are only presented in the lower 
panel; LC4 and Sub2 have been omitted entirely. 
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Figure 12 

Size of the compact representation. 



The curve labelled G represents the percentage of sentences that are generated by 
the grammar. Note that since all approximation methods compute either supersets or 
subsets, it cannot occur that a particular automaton both recognizes some ungrammatical 
sentences and rejects some grammatical sentences. 
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Figure 13 

Number of states of the determinized and minimized automata. 
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600 



Unigram and Bigram recognize very high percentages of ungrammatical sentences. 
Much better results were obtained for RTN. The curve for LR would not be distin- 
guishable from that for RTN in the figure, and is therefore omitted. (For only two of the 
investigated grammars was there any difference, the largest difference occurring for gram- 
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Figure 14 

Number of transitions of the determinized and minimized automata. 
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mar size 217, where 34.1 versus 34.5 percent of sentences were recognized in the cases 
of LR and RTN, respectively.) Trigram remains very close to RTN (and LR); for some 
grammars a lower percentage is recognized, for others a higher percentage is recognized. 
LR2 seems to improve slightly over RTN and Trigram, but only for small grammars is 
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Figure 15 

Percentage of sentences that are recognized. 



data available, due to the difficulty of applying the method to larger grammars. A more 
substantial improvement is found for RTN2. Even smaller percentages are recognized by 
LR3 and RTN3, but again, only for small grammars is data available. 

The subset approximations LC3 and Subl remain very close to G, but also here only 
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Figure 16 

Number of states and number of transitions of the determinized and minimized automata. 



data for small grammars is available, since these two methods could not be applied on 
larger grammars. Although application of LC2 on larger grammars required relatively few 
resources, the approximation is very crude: only a small percentage of the grammatical 
sentences are recognized. 

We also performed experiments with the grammar transformation from Section g, 
in combination with the RTN method. We found that for increasing j, the intermediate 
automata soon became too large to be determinized and minimized, with a bound on 
the memory consumption of 400 Megabytes. The sizes of the automata that we were 
able to compute are given in Figure [l(| 'RTN+j,' for j — 1,2,3,4,5, represents the 
(unparameterized) RTN method in combination with the grammar transformation with 
parameter j. This is not to be confused with the parameterized 'RTNcT method. 

Figure HJ indicates the number of sentences in the corpus that are recognized by 
an automaton divided by the number of sentences in the corpus that are generated by 
the grammar. For comparison, the figure also includes curves for RTNri, where d = 2, 3 
(cf. Figure [l5|). We see that j = 1,2 has little effect. For j = 3,4,5, however, the 
approximating language becomes substantially smaller than that in the case of RTN, but 
at the expense of large automata. In particular, if we compare the sizes of the automata 
for RTN+j in Figure 16 with those for RTNd in Figures M and n4l then Figure O 
suggests the large sizes of the automata for RTN+j are not compensated adequately by 
a reduction of the percentage of sentences that are recognized. RTNri seems therefore 
preferable over RTN+j. 



7 Conclusions 

If we apply the finite automata with the intention of filtering out incorrect sentences, 
for example from the output from a speech recognizer, then it is allowed that a certain 
percentage of ungrammatical input is recognized, since this merely makes filtering less 
effective, but does not affect the functionality of the system as a whole, provided we 
assume that the grammar specifies exactly the set of sentences that can be successfully 
handled by a subsequent phase of processing. Also allowed is that "pathological" gram- 
matical sentences are rejected that seldom occur in practice; an example are sentences 
requiring multiple levels of self-embedding. 
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Of the methods we considered that may lead to rejection of grammatical sentences, 
i.e. the subset approximations, none seems of much practical value. The most serious 
problem is the complexity of the construction of automata from the compact representa- 
tion for large grammars. Since the tools we used for obtaining the minimal deterministic 
automata are considered to be of high quality, it is doubtful that alternative implemen- 
tations could succeed on much larger grammars, also considering the sharp increases in 
the sizes of the automata for small increases in the size of the grammar. Only LC2 could 
be applied with relatively few resources, but this is a very crude approximation, which 
leads to rejection of many more sentences than just those requiring self-embedding. 

Similarly, some of the superset approximations are not applicable to large grammars 
because of the high costs of obtaining the minimal deterministic automata. Some oth- 
ers provide rather large languages, and therefore do not allow very effective filtering of 
ungrammatical input. One method however seems to be excellently suited for large gram- 
mars, viz. the RTN method; into consideration come the unparameterized version and 
the parameterized version with d = 2. In both cases, the size of the automaton grows 
moderately in the grammar size. For the unparameterized version, also the compact 
representation grows moderately. Furthermore, the percentage of recognized sentences 
remains close to the percentage of grammatical sentences. It seems therefore that, under 
the conditions of our experiments, this method is the most suitable regular approximation 
that is presently available. 
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